The Cell DEep learning and COmputational DEscriptoR (DECODER) toolbox is a set of high-level Python modules for applying deep learning to image analysis and phenotypic profiling in cell biology. It is currently built on the Microsoft Cognitive Toolkit (CNTK). This toolbox is not meant to replace popular deep learning APIs, such as Keras, which provide extensive tools and support for building custom neural networks. Rather, Cell DECODER is aimed at cell biologists who wish to apply deep learning models trained on biological datasets to their own images with minimal additional training or customization.
There are five major stages to any learning algorithm. The Cell DECODER toolbox takes care of most of these processes within minimal user input required. These stages include:
There are multiple factors to consider within each stage, which we will not addres here. For more information, we direct the reader to the following resources:
# cntk_imports.py
#
# Copyright (c) 2017 Jeffrey J. Nirschl
# All rights reserved
# All contributions by Jeffrey J. Nirschl in
# the laboratory of Erika Holzbaur at the
# University of Pennsylvania.
#
# Distributable under an MIT License
# ======================================================================
# Last updated JN 20170922
# Standard library imports
from __future__ import print_function
import os
import re
import sys
import time
import numpy as np
import pandas as pd
# Third-party imports
import cv2
from bokeh.io import output_notebook, show
import holoviews as hv
import matplotlib.pyplot as plt
# Import cell_decoder
import cell_decoder
# Configure plotting libraries
%matplotlib inline
output_notebook()
hv.notebook_extension('bokeh')
First, we will read our data in the form of a 'mapfile'. This is a comma or tab-delimited text document that specifies the path to each image and the corresponding label, as shown below. Note that the labels are indexed from zero not one.
/path/to/my/images/HeLa_001.png 0
/path/to/my/images/USOS_002.png 1
/path/to/my/images/HeLa_002.png 0
If you don't have a mapfile yet, don't worry! Cell DECODER can create a mapfile from a directory of images.
my_directory = '/path/to/my/images/'
mapfile = cell_decoder.io.mapfile_utils.create_mapfile(my_directory)
Let's take a look at the Human Protein Atlas dataset, which was used to train Cell Decoder. The subset we use here contains 22 different cell lines, all stained for DNA (DAPI), microtubules (alpha-Tubulin DM1A), or the endoplasmic reticulum (calnexin/ calreticulin).
import plotly.plotly as py
from plotly.offline import init_notebook_mode, plot, iplot # download_plotlyjs,
import plotly.figure_factory as ff
init_notebook_mode(connected=True)
filepath = os.path.join(os.path.dirname(cell_decoder.__file__), 'data', 'datasets',
'human_protein_atlas','cell_lines',
'HumanProteinAtlas_learn_cell_lines_info.csv')
df_cell_lines = pd.read_csv(filepath, header=0)
df_cell_lines.ix[0:21,:]
First, we create a new instance of a DataStructure, which will hold all of our information (mapfiles, labels, deep learning models etc.). You can customize any of the following classes: DataStructParameters, TransformParameters, or LearningParameters or you can use the default settins. It will use default settings if you leave the parentheses empty. Notice that it will performs error checks to make sure everything goes smoothly.
# Initiate a DataStruct object, which contains all of the information
from cell_decoder.io import DataStruct
mapfile = os.path.normpath(os.path.join(cell_decoder.__file__, '../mapfiles/human_protein_atlas/cell_lines',
'all_cells_mapfile.tsv'))
data_struct = DataStruct(mapfile)
Let's take a look at some of these images from the Human Protein Atlas. The original images (2048x2048) were acquired on a confocal microscope using a 100x objective. We resized the images to ~32x magnification or 250nm per pixel(655x655). The network accepts a 224x224 image, and at this resolution a single random crop will capture cellular/ microenvironment context while retaining the resolution to identify subcellular structures.
During training the images are randomly resized from a uniform random range [0.4, 0.875] where 1 represents resizing the full image 655x655 to 224x224. A value of 0.5 represents taking a random crop of 262x262 and then resizing to 224x224. As we show later, networks trained on images between 16 - 32x does not limit application to that magnification range
Each image is stained for DNA (DAPI), microtubules (alpha-tubulin, clone DM1A), and the endoplasmic reticulum (calnexin or calreticulin). Here, we resize the 655x655 image to 224x224 to visualize the entire field.
Image credit: Human Protein Atlas
# The DataStruct class can automatically read a random image from each class/ label for quick visualization
hv_img = data_struct.plot_unique()
hv.Layout(hv_img[0:4]).cols(2)
Preprocessing data is an important step for learning algorithms. For images, this generally involves subtracting the mean (centering) and linearly normalizing the pixel intensity to a unit range [0-1] (scaling). Here, we will compute the image mean so that we can subtract it later. The data will be automatically scaled from [0-255] to [0-1] as each image is read.
# Compute image mean
image_mean = data_struct.compute_image_mean(data_aug=True, nargout=True, save_img=False)
# Display image mean
%opts RGB [xaxis=None yaxis=None]
hv.RGB(image_mean, label='Mean over all images')
Here, we will create a reader object that will read and shuffle images in our mapfile. We will feed the output to the network during training.
In the cell below, we create a model from our data_struct that contains the neural network architecture and the inputs/ labels.
# Create reader
from cell_decoder.config import TransformParameters
train_params = TransformParameters()
train_mb_source = data_struct.create_mb_source(train_params,
is_training=True)
# Create model
model_dict = data_struct.create_model()
Here, we traing the model for a pre-specific number of epochs (one full pass through all images) or until a pre-defined stopping criteria. We will leave the settings to the defaults (100 epochs). Cell DECODER also automatically splits the data into train, test, and held-out datasets. During training, it performs cross-validation to get a sense of how well the algorithm will generalization. The held-out test dataset is never used during training.
# Train model
net = data_struct.train_model(model_dict)
One of the last step in developing a new learning algorithm is to test the model performance on a held-out test set. The held-out test set represents a small sample of data that was randomly removed from the dataset before training, and which was never used during training. This is the gold-standard for measuring generalization. If the model simply 'memorized' the training images, it would not perform well on this held-out test set.
TODO
Now we can apply our trained model to new images in order to predict the cell type. In itself, this may not be very useful to many biologists.
Alternatively, we can remove the final classification layer encoding the probability of cell type to leave a high dimensional vector encoding object models. These object models are representations that the algorithm learned were useful to predict cell type. Our hypothesis is that these hierarchical represenations may also be useful to discover phenotypes in different biological datasets.
The output of the truncated network is a vector with 2048 dimensions that represent features or object models useful to predict cell type in the next layer. In other words, the image cell type is encoded by a vector 2048 numbers. This might seem like a large number, but if we consider of each RGB pixel of the original image as a dimension then the original image could be stretched out into a vector with 150528 dimensions (224x224x3).
Since we live in a 3D world, it is difficult to think in higher dimensional spaces such as 4, 5, or 6D, let alone 1000+ dimensions. However, the high-dimensional representation encodes useful information that can't be appreciated in 2 dimesions. In order to preserve the structure in high-dimensional space, we can project or embed the data into a lower dimensional space in a way that retains some properties of the original data structure. This is known as dimensionality reduction. There are linear methods to reduce dimensionity (such as PCA, Isomap, LLE, etc) and nolinear methods (such as tSNE). We use tSNE, an algorithm developed by Laurens van der Maaten in Geoff Hinton's laboratory because it is particularly good for visualizing high dimensional datasets.
First, let's load the original feature vector from the final pooling layer (2048 vector).
filepath = os.path.join(os.path.dirname(cell_decoder.__file__), 'data',
'profiling','cells',
'Res50_81_cells_featvec.csv')
df_features = pd.read_csv(filepath, header=0)
df_labels = pd.read_csv(filepath.replace('featvec','labels'), header=None)
I show the first five rows of output below. Each row represents one held-out test image that the network has never 'seen' during training. Each column represents the output of a single hidden neuron in the final pooling layer, before predicting cell type. A separate column vector stores the cell type labe, where each row has a number indicating the true cell type.
df_features = df_features.rename(columns={elem:'{0:d}'.format(elem) for elem in range(df_features.shape[1])})
df_features.head()
I reduced the dimensionality of this dataset usign the tSNE algorithm. The tSNE algorithm reduces dimensionality in a way that preserves high-dimensional structure and it does not use the true cell type label.
The table below shows the output of tSNE where I reducedd the original 2048 dimensional vector into 8 dimensions. I concatenated the cell type label to the array after dimensionality reduction.
filepath = os.path.join(os.path.dirname(cell_decoder.__file__), 'data',
'profiling','cells',
'Res50_81_cells_mapped_labels.csv')
df = pd.read_csv(filepath, header=None)
df = df.rename(columns={elem:'tsne_{0:d}'.format(elem) for elem in range(df.shape[1])})
df = df.rename(columns={'tsne_8':'label'})
df.head()
Let's visualize the first three dimensions using a 3D scatterplot.
fig = cell_decoder.visualize.plot.prepare_plotly(df)
py.iplot(fig, filename='Phenotypic Profiling')
If you use this for your research, please cite:
Nirschl JJ, Moore ASM, Holzbaur ELF. Cell DECODER: A Deep Learning Toolbox for Phenotypic Profiling in Biological Image Analysis.The authors gratefully acknowledge the NVIDIA Corporation's Academic Hardware Grant of a Titan-X GPU. JJN was supported by NINDS F30-NS092227.
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.